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Abstract 

We define a measure of competitive performance for distributed al- 
gorithms based on throughput, the number of tasks that an algorithm 
can carry out in a fixed amount of work. This new measure com- 
plements the latency measure of Ajtai et al. [3], which measures how 
quickly an algorithm can finish tasks that start at specified times. The 
novel feature of the throughput measure, which distinguishes it from 
the latency measure, is that it is compositional: it supports a notion of 
algorithms that are competitive relative to a class of subroutines, with 
the property that an algorithm that is fc-competitive relative to a class 
of subroutines, combined with an ^-competitive member of that class, 
gives a combined algorithm that is fc^-competitive. 

In particular, we prove the throughput-competitiveness of a class of 
algorithms for collect operations, in which each of a group of n processes 
obtains all values stored in an array of n registers. Collects are a funda- 
mental building block of a wide variety of shared-memory distributed 
algorithms, and we show that several such algorithms are competitive 
relative to collects. Inserting a competitive collect in these algorithms 
gives the first examples of competitive distributed algorithms obtained 
by composition using a general construction. 
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1 Introduction 



The tool of competitive analysis was proposed by Sleator and Tarjan [49] to 
study problems that arise in an on-line setting, where an algorithm is given 
an unpredictable sequence of requests to perform operations, and must make 
decisions about how to satisfy its current request that may affect how effi- 
ciently it can satisfy future requests. Since the worst-case performance of an 
on-line algorithm might depend only on very unusual or artificial sequences 
of requests, or might even be unbounded if one allows arbitrary request se- 
quences, one would like to look instead at how well the algorithm performs 
relative to some measure of difficulty for the request sequence. The key 
innovation of Sleator and Tarjan was to use as a measure of difficulty the 
performance of an optimal off-line algorithm, one allowed to see the entire 
request sequence before making any decisions about how to satisfy it. They 
defined the competitive ratio, which is the supremum, over all possible input 
sequences a, of the ratio of the performance achieved by the on-line algo- 
rithm on a to the performance achieved by the optimal off-line algorithm on 
(7, where the measure of performance depends on the particular problem. 

We would like to apply competitive analysis to the design of asyn- 
chronous distributed algorithms, where the input sequence may reflect both 
user commands and the timing of events in the underlying system. Our 
goal, following previous work of Ajtai et al. [3], is to find competitive algo- 
rithms for core problems in distributed computing that can then be used to 
speed up algorithms for solving more complex problems. To this end, we 
need a definition of competitive performance that permits composition: the 
construction of a competitive algorithm by combining a competitive super- 
structure with a competitive subroutine. Our efforts to find such a definition 
have led us to a notion of competitive throughput, which counts the number 
of operations or tasks that can be completed by some algorithm in a fixed 
time frame. Wc show that for a particular problem, the cooperative collect 
problem, it is possible both (a) to obtain cooperative collect algorithms with 
good competitive throughput and (b) to use these algorithms as subrou- 
tines in many standard algorithms not specifically designed for competitive 
analysis, and thereby obtain competitive versions of these algorithms. 

We begin with a short history of competitive analysis in distributed 
algorithms (Section 1.1), followed by a discussion of the cooperative collect 
primitive (Section 1.2) and an overview of our approach and the organization 
of the rest of the paper (Section 1.3). 
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1.1 Competitive analysis and distributed algorithms 

In a distributed setting, there are additional sources of nondeterminism other 
than the request sequence. These include process step times, request arrival 
times, message delivery times (in a message-passing system) and failures. 
Moreover, a distributed algorithm has to deal not only with the problems 
of lack of knowledge of future requests and future system behavior, but also 
with incomplete information about the current system state. Due to the ad- 
ditional type of nondeterminism in the distributed setting, it is not obvious 
how to extend the notion of competitive analysis to this environment. 

Early work on distributed job scheduling and data management [4, 16, 
17, 19-22] took the approach of comparing a distributed on-line algorithm to 
a global-control off-line algorithm. However, as noted in [17], using a global- 
control algorithm as a reference has the unfortunate side-effect of forcing the 
on-line algorithm to compete not only against algorithms that can predict 
the future but against algorithms in which each process can deduce what all 
other processes are doing at no cost. 

While such a measure can be useful for algorithms that are primarily 
concerned with managing resources, it unfairly penalizes algorithms whose 
main purpose is propagating information. Ajtai et al. [3] described a more 
refined approach in which a candidate distributed algorithm is compared 
to an optimal champion. In their com,petitive latency model, both the can- 
didate and champion are distributed algorithms. Both are subject to an 
unpredictable schedule of events in the system and both must satisfy the 
same correctness condition for all possible schedules. The difference is that 
the adversary may supply a different champion optimized for each individual 
schedule when measuring performance. 

1.2 Cooperative collect 

The competitive latency model was initially designed to analyze a particular 
problem in distributed computing called cooperative collect, first abstracted 
by Saks et al. [48]. The cooperative collect problem arises in asynchronous 
shared- memory systems built from single- writer registers.^ In order to ob- 
serve the state of the system, a process must read n — 1 registers, one for 
each of the other processes. The simplest implementation of this operation 
is to have the process carry out these n — 1 reads by itself. However, if many 
processes are trying to read the same set of registers simultaneously, some 

^ A single- writer register is one that is "owned" by some process and can only be written 
to by its owner. 
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of the work may be shared between them. 

A collect operation is any procedure by which a process may obtain the 
vahics of a set of n registers (including its own). The correctness conditions 
for a collect are those that follow naturally from the trivial implementation 
consisting of n — 1 reads. The process must obtain values for each reg- 
ister, and those values must be fresh, meaning that they were present in 
the register at some time between when the collect started and the collect 
finished. 

Curiously, the trivial implementation is the one used in almost all of the 
many asynchronous shared-memory algorithms based on collects, including 
algorithms for consensus, snapshots, coin flipping, bounded round numbers, 
timestamps, and multi-writer registers [1, 2, 5, 6, 8, 9, 11, 12, 14, 23-25, 27-29, 
31, 33, 34, 36, 38-40, 50]. (Noteworthy exceptions are [47, 48], which present 
interesting collect algorithms that do not follow the pattern of the triv- 
ial algorithm, but which depend on making strong assumptions about the 
schedule.) Part of the reason for the popularity of this approach may be 
that the trivial algorithm is optimal in the worst case: a process running in 
isolation has no alternative but to gather all n — 1 register values itself. 

Ajtai et al.'s [3] hope was that a cooperative collect subroutine with a 
good competitive ratio would make any algorithm that used it run faster, at 
least in situations where the competitive ratio implies that the subroutine 
outperforms the trivial algorithm. To this end, they constructed the first 
known competitive algorithm for cooperative collect and showed a bound 
on its competitive latency. Unfortunately, there are technical obstacles in 
the competitive latency model that make it impossible to prove that an 
algorithm that uses a competitive collect is itself competitive. The main 
problem is that the competitive latency includes too much information in 
the schedule: in addition to controlling the timing of events in the underlying 
system such as when register operations complete, it specifies when high- 
level operations such as collects begin. So the competitive latency model 
can only compare a high-level algorithm to other high-level algorithms that 
use collects at exactly the same times and in exactly the same ways. 

1.3 Our approach 

In the present work, we address this difficulty by replacing the competi- 
tive latency measure with a competitive throughput measure that assumes 
that the candidate and champion face the same behavior in the system, but 
breaks the connection between the tasks carried out by the candidate and 
champion algorithms. This model is described in detail in Section 3. The 
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intuition is that when analyzing a distributed algorithm it may be helpful 

to distinguish between two sources of nondeterminism, user requests (the 
input) and system behavior (the schedule). Previous work that compares a 
distributed algorithm with a global control algorithm [4, 16, 17, 19-22] im- 
plicitly makes this distinction by having the on-line and off-line algorithms 
compete only on the same input, generally hiding the details of the schedule 
in a worst-case assumption applied only to the on-line algorithm. In effect, 
these models use a competitive input but a worst-case schedule. The com- 
petitive latency model of [3] applies the same input and schedule to both 
the on-line and the off-line algorithms. In contrast, we assume a worst-case 
input but a competitive schedule. Assuming a worst-case input means an 
algorithm must perform well in any context — including as part of a larger 
algorithm. At the same time, comparing the algorithm to others with the 
same schedule gives a more refined measure of the algorithm's response to 
bad system behavior than a pure worst-case approach. 

The competitive throughput model solves the problem of comparing an 
algorithm A using a competitive subroutine B against an algorithm A* that 
uses a subroutine B* implementing the same underlying task, but it does 
not say anything about what happens when comparing A to an optimal 
A* that does not call B*. For this we need an additional tool that we call 
relative competitiveness, described in Section 4. We show (Theorem 4) that 
if an algorithm A is A;-rclative-competitive with respect to an underlying 
subroutine B, and B is itself /-competitive, then the combined algorithm 
Ao B is fcl-competitive, even against optimal algorithms A* that do not use 
B. 

To demonstrate the applicability of these techniques, we show in Sec- 
tion 5 that the results of [3] can be extended to bound the competitive 
throughput of their algorithm; in fact, our techniques apply to any algo- 
rithm for which we have a bound on an underlying quantity that [3] called 
the collective latency, a measure of the total work needed to finish all tasks 
in progress at any given time. (This result has been used since the confer- 
ence appearance of the present work by Aspnes and Hurwood [10] to prove 
low competitive throughput for an algorithm that improves on the algorithm 
of [3].) We show in Section 6 that relative competitiveness, combined with 
a throughput-competitive collect algorithm, does in fact give throughput- 
competitive solutions to problems such as atomic snapshot [2, 5, 9, 12, 14] 
and bounded round numbers [29]. We argue that most algorithms that use 
collects can be shown to be throughput-competitive using similar techniques. 

Finally, in Section 7 we discuss some related approaches to analyzing 
distributed algorithms and consider what questions remain open. 
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2 Model 



We use as our underlying model the wait-free shared-memory model of [37], 
using atomic single-writer multi-reader registers as the means of communi- 
cation between processes. Because the registers are atomic, we can represent 
an execution as an interleaved sequence of steps, each of which is a read or 
write of some register. The timing of events in the system is assumed to be 
under the control of an adversary, who is allowed to see the entire state of 
the system (including the internal states of the processes and the contents 
of the registers) . The adversary decides at each time unit which process gets 
to take the next step; these decisions are summarized in a schedule, which 
formally is just a sequence of process id's. 

The algorithms we consider implement objects, which are abstract con- 
current data structures with well-defined interfaces and correctness condi- 
tions. We assume that: 

1. The objects are manipulated by invoking tasks of some sort; 

2. That each instance of a task has a well-defined initial operation and a 
well-defined final operation (which may equal the initial operation for 
simple tasks). 

3. That the definition of an initial or final operation depends only on 
the operation and the preceding parts of the execution, so that the 
completion of a task is recognizable at the particular step of the ex- 
ecution in which its final operation is executed, without needing to 
observe any subsequent part of the execution, and so that the number 
of tasks completed in an execution can be defined simply by counting 
the number of final operations executed; and 

4. That there is a predicate on object schedules that distinguishes correct 
executions from incorrect executions, so that correct implementations 
are defined as those whose executions always satisfy this correctness 
predicate. 

Beyond these minimal assumptions, the details of objects will be left un- 
specified unless we are dealing with specific applications. 

Each process has as input a request sequence specifying what tasks it 
must carry out. The request sequences are supplied by the adversary and are 
not part of the schedule, as we may wish to consider the effect of different 
request sequences while keeping the same schedule. We assume that the 
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Figure 1: Throughput model. High-level operations (ovals) are implemented 
as a sequence of low-level steps (circles), which take place at times deter- 
mined by the adversary. New high-level operations start as soon as previous 
operations end. Payoff to the algorithm is number of high-level operations 
completed. 

request sequences are long enough that a process never runs out of tasks to 
perform. 

The performance of an algorithm is measured by its competitive through- 
put, defined in Section 3. We contrast this definition with the competitive 
latency measure of Ajtai et al. [3] in Section 3.1. Building throughput- 
competitive algorithms by composition is described in Section 4. 

3 Competitive throughput 

The competitive throughput of an algorithm measures how many tasks an 
algorithm can complete with a given schedule. The assumption is that each 
process starts a new task as soon as each previous task is finished, as shown 
in Figure 1. 

We measure the algorithm against a champion algorithm that runs un- 
der the same schedule. We do not assume that both algorithms are given 
the same request sequences; we only require that the two sets of request 
sequences be made up of tasks for the same object T. This assumption 
may seem unfair to the candidate algorithm, but it is necessary to allow 
algorithms to be composed. In reasoning about competitiveness composi- 
tionally, we compare the efficiency of a candidate B used as a subroutine in 
some higher-level algorithm A with the champion B* used as a subroutine 
in some optimal higher-level algorithm A* . In general we do not expect A 
and A* to generate the same request sequences to B and B* (hence the split 
between worst-case request sequences for B and best-case for B*), but we 
can insist that both B and B* run under the same schedule. 

We start by introducing some notation. For each algorithm A, sched- 
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ule cr, and set of request sequences R, define done{A,a,R) to be the to- 
tal number of tasks completed by all processes when running A according 
to the schedule a and set of request sequences R. Define opt2-(o") to be 
maxyi*^ij* done{A*,a,R*), where A* ranges over all correct implementations 
of T and R* ranges over all sets of request sequences composed of T-tasks. 
(Thus, optji(cr) represents the performance of the best correct algorithm 
running on the best-case request sequences for the fixed schedule a.) 

Definition 1 Let A be an algorithm that implements an object T . Then A 
is fe-throughput-conipctitivc for T if there exists a constant c such that, for 
any schedule a and set of request sequences R, 

donef^, a,R) +c> y optrpia). (1) 
k 

This definition follows the usual definition of competitive ratio [49] . The 
ratio is inverted since done(^, a, R) measures a payoff (the number of com- 
pleted tasks) to be maximized instead of a cost to be minimized. The 
constant c is included to avoid problems that would otherwise arise from 
the granularity of tasks. On very short schedules, it might be impossible for 
A to complete even a single task, even though the best A* could. Allowing 
the constant (which has minimal effect on longer schedules) gives a measure 
that more realistically describes the performance of A on typical schedules. 

3.1 Compcirison with competitive latency 

The competitive throughput measure was inspired by the similar competitive 
latency measure of Ajtai et al. [3]. Competitive latency is not used in this 
paper, but we will give the definition from [3] to permit direct comparison 
between the two measures. 

In the competitive latency model, the request sequences, including the 
times at which tasks start, are included in the schedule (see Figure 2). Thus 
the schedule a includes both user input (the request sequences) and system 
timing (when each process is allowed to take a step) . It is assumed that each 
task runs to completion, and that the process executing the task becomes 
idle until its next task starts; if the schedule calls for the process to carry 
out a step in between tasks, it performs a noop. The total work done by 
an algorithm A given a schedule cr, written work(^, a), is defined as the 
number of operations performed by processes outside of their idle periods. 

The competitive latency of a candidate algorithm A is defined as [3] : 

work(>l, a) 
J' ml A* work(A* , cr) ' 
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Figure 2: Latency model. New tasks (ovals) start at times specified by the 

schedule (vertical bars). Schedule also specifies timing of low- level steps 
(small circles). Cost to algorithm is number of low- level operations actu- 
ally performed (filled circles), ignoring time steps allocated to processes in 
between tasks (empty circles). 

where a ranges over all schedules in which every task demanded of A is 
given enough steps to finish before a new task starts. We can rewrite this 
in a form closer to that of Definition 1, by writing that an algorithm A is 
A;-latency-competitive if, for all schedules a that permit A to finish its tasks, 
and all champion algorithms A*, 



Note that this definition docs not include an additive constant. The def- 
inition of competitive throughput does, to avoid problems with very short 
schedules in which the candidate A cannot complete any tasks. This prob- 
lem does not arise with the competitive latency definition because of the 
restriction to schedules in which A can complete all assigned tasks. 

Competitive latency has some advantages over competitive throughput. 
Because the request sequences are part of the schedule, it can be used to 
evaluate algorithms for which some tasks are much more expensive than 
others. In such a situation, the candidate in the competitive throughput 
model may be stuck with hard tasks while the champion breezes through 
easy ones. Competitive latency may thus be a better model than competitive 
throughput for measuring the performance of algorithms in isolation, though 
competitive throughput is a better measure for subroutines, as it allows 
composition using the results in Section 4. Often, difficulties with varying 
costs can also be ameliorated by joining cheap tasks to subsequent expensive 
ones, as is done in Sections 5.1 and 6.1. 

On the other hand, competitive throughput removes some awkward fea- 
tures of competitive latency. In particular, the assumption that processes 



work(yl, cr) < kwovk{A*,a). 
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are idle in between tasks (which is a necessary side-effect of specifying when 
in the sequence of operations each new task starts) may not be a good rep- 
resentation of how distributed algorithms are implemented in practice. A 
further concern is that if a process becomes idle quickly (say, during a brief 
lenient period in the schedule that allows quick termination), it is unaffected 
by harsher conditions that may arise later. This means in particular that 
a candidate algorithm that is only slightly slower than the champion it is 
competing against may find itself operating under much worse conditions, 
as the schedule suddenly gets worse as soon as the champion finishes. 

The contrast between competitive latency and competitive throughput 
suggests a trade-off between competing notions of fair competition. Compet- 
itive latency treats different algorithms unfairly with respect to the sched- 
ule, by forcing a candidate to continue running in bad conditions after the 
champion has finished in good ones. But competitive throughput treats 
different algorithms unfairly with respect to the request sequences, because 
the definition explicitly assumes that the requests given to the candidate 
and champion may be different. It is not clear whether a more sophisticated 
definition could avoid both extremes, and produce a more accurate measure 
of the performance of a distributed algorithm compared to others running 
under similar conditions. 

4 Composition of competitive algorithms 

The full power of the competitive throughput measure only becomes appar- 
ent when we consider competitive algorithms built from competitive sub- 
routines. In traditional worst-case analysis, an algorithm that invokes a 
subroutine k times at a cost of at most I time units each pays a total of 
kl time units. In a competitive framework, both the number of times the 
subroutine is called and the cost of each call to the subroutine may de- 
pend on system nondeterminism. The analogous quantity to the cost I of 
each subroutine call is the competitive ratio of the subroutine. What is an 
appropriate analog of the number of times k that the subroutine is called? 

In Section 4.1, we define a notion of relative competitiveness that char- 
acterizes how well an algorithm uses a competitive subroutine. As shown in 
Section 4.2, algorithms that are /c-throughput-competitive relative to an l- 
throughput-competitive subroutine are themselves throughput-competitive, 
with ratio kl. The definition of relative competitiveness (Definition 2) and 
the composition theorem that uses it (Theorem 4) yield a method for con- 
structing competitive algorithms compositionally. Some examples of appli- 
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cations of this method appear in Section 6. To our knowledge this is the 
first example of a general composition theorem for competitive algorithms, 
even outside of a distributed setting. 

4.1 Relative competitiveness 

As in the definition of throughput-competitiveness, we consider a situation 
in which A is an algorithm implementing some object T. Here, however, we 
assume that A depends on a (possibly unspecified) subroutine implementing 
a different object U. For any specific algorithm B that implements U, we 
will write A o B for the composition of A with B, i.e., for that algorithm 
which is obtained by running B whenever A needs to carry out a U-task? 

Definition 2 An algorithm A is /c-throughput-competitive for T relative to 
U if there exists a constant c such that for any B that implements U, and 
any schedule a and request sequence R for which the ratios are defined, 

dojie{Ao B,a, R) + c ^ 1 opt2^((T) 
done(B, cr, i?^) ~k opt[/((T)' 

where Ra is the request sequence corresponding to the subroutine calls in A 
when running according to R and a. 

As in the preceding definition, the additive constant c is included to avoid 
problems with granularity. The condition that the ratios are defined, which 
in essence is just a requirement that a be long enough for B to complete at 
least one [/-task, is needed for the same reason. 

The condition that the ratios are defined in (2) does create a curious 
loophole in the definition of relative competitiveness: if A implements some 
object T using an object U whose tasks can never be completed by a cor- 
rect implementation, then the denominators in the inequality (2) are always 
zero, and thus A is vacuously zero-competitive relative to U. Similarly, an 
implementation B of U that never completes any tasks will be vacuously 
zero-competitive for U. Since Ao B is unlikely to be zero-competitive for T, 
to apply relative competitiveness we will need to exclude such pathologies. 
We do so using the following definition of relative feasibility of objects: 



^For this definition it is important that A not execute any operations that are not 

provided by U. In practice, the difficulties this restriction might cause can often be 
avoided by treating ?7 as a composite of several different objects. 
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Definition 3 Let T and U be objects. Say that T is feasible relative to U 
if there exists a constant c such that for all schedules a, 

opt-j.(cr) < c- opt(y(cj) (3) 

In particular, if Definition 3 holds, then in any schedule where A completes 
at least one operation, optj; is not zero and the right-hand side of (2) is 
defined. Furthermore, if B is competitive relative to then done(B, cr, Ra) 
is also nonzero for sufficiently long schedules, and the left-hand side of (2) 
is also defined. 

4.2 The composition theorem 

Theorem 4 describes under what conditions a relative-competitive algorithm 
combines with a competitive subroutine to yield a competitive algorithm. 

Theorem 4 Let A be an algorithm that is k -throughput- competitive for T 
relative to U , where T is feasible relative to U. Let B be an l-throughput- 
competitive algorithm for U. Then Ao B is kl -throughput- competitive for 
T. 

Proof: For the most part the proof requires only very simple algebraic 
manipulation of the definitions, but we must be careful about the constants 
and avoiding division by zero. 

Fix a and R. We can rewrite the inequality (2) as 

(done(A, (T, R) -\- ca) optij{a) > opt7^(cr) done(S, a, Ra), (4) 

where ca is the constant from the definition of relative competitiveness for 
A. Note that (4) holds even if one or both of the ratios in (2) is undefined, 
since in that case done{B,a, Ra) must be zero and all other quantities are 
non-negative. 

We can similarly rewrite (1) as 

done(B, a, Ra) > y opiu{a) - cb (5) 

where is a constant independent of a and R. Plugging (5) into the 
right-hand side of (4) gives 

1 

(done(^, a, R) -\- ca) opt,7(cr) > — optr(cr) opt,7(c7) - — optTia), 
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which combines with the relative feasibihty condition opty(cr) < CToptjj(a) 
to give 

(done(A, a, R) + ca) opiij{a) > opt^ia) opiij{a) - opiu{a). 

This last inequality gives the desired result, as either opt^/ (cr) > and 
we can divide out opt[/(cr), or optij{a) = and thus opty(cr) = 0. In either 
case we have 

done(^, a, R) +ca-\ ^ > opty(cr). (6) 



Since we have dropped no terms in this derivation, if each of the inequal- 
ities (1), (2), and (3) used in the proof is tight, then (6) is also tight; so the 
additive constant ca + is the best possible that can be obtained without 
using additional information. 



5 Cooperative collects 

In this section, we define the write-collect object, which encapsulates the 
cooperative collect problem, and show how any cooperative collect algorithm 
satisfying certain natural criteria is throughput-competitive. 



5.1 The write-collect object 

The write-collect object acts like a set of n single-writer n-reader atomic 
registers and provides two operations for manipulating these registers. 

1. A collect operation returns the values of all of the registers, with the 
guarantee that any value returned was not overwritten before the start 
of the collect. 

2. A write-collect operation writes a new value to the process's register 
and then performs a collect. 

The write-collect operation must satisfy a rather weak serialization con- 
dition. Given two write-collects a and b: 

• If the first operation of a precedes the first operation of b, then b 
returns the value written by a as part of its vector. 
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• If the first operation of a follows the last operation of b, then b does 
not return the value written by a. 

• If the first operation of a occurs during the execution of b, b may return 
either the value written by a, or the previous value in the register 
written to by a. 

A trivial implementation of write-collect might consist of a write followed 
immediately by n reads. 

Our definition of the write-collect operation is motivated by the fact that 
many shared-memory algorithms execute collects interspersed with write 
operations (some examples are given in Section 6) . Treating write and collect 
as separate operations, though in many ways a more natural approach, also 
leads to difficulties in applying competitive throughput, as a candidate doing 
only expensive collects might find itself in competition with a champion 
doing only cheap writes. 

5.2 Competitive algorithms for write-collect 

To implement a write-collect, we start with the cooperative collect algorithm 
of [3]. This algorithm has several desirable properties, shown in [3]: 

1. All communication is through a set of single- writer registers, one for 
each process, and the first step of each collect operation is a write. 

2. No collect operation ever requires more than 2n steps to complete. 

3. For any schedule, and any set of collects that are in progress at some 
time t, there is a bound of (9(n"^/^ log^ n) on the total number of steps 
required to complete these collects. 

These properties arc what we need from a cooperative collect implemen- 
tation to prove that it gives a throughput-competitive write-collect. The first 
property allows us to ignore the distinction between collect and write-collect 
operations (at least in the candidate): we can include the value written by 
the write-collect along with this initial write, and thus trivially extend a 
collect to a write-collect with no change in the behavior of the algorithm. 
In effect, our throughput-competitive write-collect algorithm is simply the 
latency-competitive collect of [3], augmented by merging the write in a write- 
collect with the first write done as part of the collect implementation. 

The last two properties give two complementary bounds on the number 
of steps needed to finish collects in progress at any given time. The bound on 
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the total work to finish a set of simultaneous collects, called the collective la- 
tency, shows that processes can combine their efforts effectively when many 
are running simultaneously. The bound on the work done by any individual 
process, called the private latency, applies when only a few processes are run- 
ning. We show, in Section 5.3, that any algorithm A with collective latency 
CL(^) and private latency 0{n) is O ^-\/CT(]4)^ -throughput-competitive. 

For the algorithm of [3], the proof gives a competitive ratio of 
0(n^/^ log n). This is the best algorithm currently known for doing collects 
in a model in which the adversary has complete knowledge of the current 
state of the system. It is likely that better algorithms are possible, even in 
this strong model, although the analysis of cooperative collect algorithms 
can be very difficult. 

Other authors have devised faster algorithms for weaker models; see 
Section 7.3. 



5.3 Proving throughput-competitiveness 

We measure time by the total number of steps taken by all processes. Con- 
sider some execution, and let C{t) be the set of collect operations in progress 
at any time t. Each collect operation consists of a sequence of atomic read 
and write operations; if, for some algorithm A, there is a bound CL{A) on 
the total number of read and write operations performed by collects in C{t) 
at time t or later, this bound is called the collective latency of A [3]. 

Define the private latency PL(^) of A as the maximum number of read 
and write operations carried out by a single process during any one of its 
own collect operations. 

Our progress measure keeps track of how much of the collective latency 
and private latency is used up by each read or write operation. It is com- 
posed of two parts for each process p: the first part, Mp, tracks steps by 
other processes that contribute to the collective latency of a set of collects 
that includes p's current collect. The second part, Np, simply counts the 
number of steps done by p. 

A step TT at time t by a process q is useful for p if tt is part of a collect 
that started before p's current collect. In a sense, tt is useful if it contributes 
to the total work done by all collects in progress when p's current collect 
started. A step vr is extraneous for p ii n occurs during an interval where p 
has finished one collect operation but has not yet taken any steps as part 
of a new collect operation, and vr is either the first or last operation of q in 
this interval. Extraneous steps do not help p in any way, but we must count 
them anyway for technical reasons that will become apparent in the proof 
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of Lemma 6. 

Let Mp{t) be the total number of useful and extraneous steps for p in 
the first t steps of the execution. Let Np{t) be the number of steps carried 
out by p in the first t steps of the execution. 

Lemma 5 Let A be any cooperative collect algorithm for which CL{A) and 
PL(^) are bounded. Then, in any execution of A in which process p com- 
pletes a collect at time t, the total number of collects completed by p by time 
t is at least Fp(t) — 1, where 



Proof: Observe that Fp does not decrease over time. We will show that 
Fp{t) rises by at most 1 during any single collect operation of p, from which 
the stated bound will follow. 

Define to = 0, so that -Fp(to) = Mpito) = Np{to) = 0, and for each i > 
let ti be the time of the last step of p's i-th collect. We will bound the 
increase in Np and Mp between ti and tj+i separately. 

Since p performs at most PL(^) steps during each collect, we have 



Recall that Mp counts both useful steps for p (those operations of other 

processes q that occur during a collect of p and are part of a collect of q 
that started before the collect of p) and extraneous steps for p (the first and 
last steps of any other process q between two successive collects of p) . The 
interval (ti,tj+i] includes both an initial prefix before t starts its {i + l)-th 
collect and a suffix during which it carries out its {i + l)-th collect. Let 
be the time of the first step of p's (i + l)-th collect. Then during the initial 
prefix (tj,Si+i) each other process q may carry out up to two extraneous 
steps, for a total of at most 2(n — 1) extraneous steps. Useful steps occur 
only during the suffix [sj+i, tj+i], and any useful step done by some process 
q ^ p,hy definition, is part of a collect that has already started at time Sj+i. 
Since collects in progress at time Sj+i perform a total of at most CL(^) steps 
after time Sj+i, the total number of useful steps in the interval (ti,ti+i] is 
at most CL{A). Adding together the useful steps and the extraneous steps 





Npit,+i) - Npiti) < FL{A). 



gives Mp{ti+i - Mp{ti) < CL{A) + 2(n - 1). 
Thus 



Fp{ti+i) — Fp{ti) < 



< 




) 
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Starting with Fp{to) = 0, a simple induction argument then shows that 
FpiU) < i for all i. 

We now exploit the fact that Fp is nondecreasing over time (which is 
immediate from the definitions of Mp and Np and the fact that Fp is an 
increasing function of Mp and Np). Fix some time t. Let i be the largest 
integer for which Fp{t) > i, so that we have Fp(t) < i + 1 oi i > Fp(t) — 1. 
If t < ti, then Fp{t) < Fp{ti) < i. Taking the contrapositive, if Fp{t) > i, 
then t > ti- Since ti is defined as the completion time of p's i-th collect, p 
completes at least i > Fp{t) — 1 collects by time t. ■ 

Now we must show that our progress measure rises. It is easy to see that 
J2p rises by exactly 1 per step. To show that J^p ^p rises, we partition 
the schedule into intervals and look at how many processors are active during 
each interval. 

Lemma 6 Fix some collect algorithm A. Let (ti,t2] be any time interval, 
and suppose that there are exactly m processes that carry out at least one 
step during (^1,^2]- Then 

Y,Mp{t2)-Y.Mp{t,)>{^^. (8) 

Proof: Recall that Mp counts the total number of steps that arc useful for 
p or extraneous for p. It is easy to see that, for any p, Mp{t2) — Mp{ti) > 0; 
this will allow us to ignore processes that do no take steps during the interval. 
For every pair of processes that both take steps during the interval, we 
will show that at least one step of one of the processes is either useful or 
extraneous for the other, and thus raises Mp by 1 for some p. This will give 
the desired bound by counting the number of such pairs. 

Let S be the set of processes that carry out at least one step in (ti,t2]- 
Given distinct processes pi,P2 £ Define the indicator variable mp^^p^ to 
equal 1 if p2 takes at least one step during (ti,i2] that is either useful or 
extraneous for pi. Observe that for any pi G S, 

Mp,{t2) - Mp,{ti) > Yl 

P2eS,p2^pi 

from which it follows that 

T.Mp{t2)-Y,Mp{ti) > ^Mp,(t2)- 
p p pies pies 
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Pi €5 p2es,p2^pi 

{mpiP2 + ^P2Pi)- (9) 

P1,P2GS,Pi7^P2 

We will now show that, for each distinct pair of processes pi,P2 in S, 

nT-pip^ ~^ f^p2p\ — 1- 

Let pi and p2 be processes that take steps in {t\ , t2] , and let Ci and C2 
be the earliest collects of pi and p2 that overlap (ti,t2]- Assume that Ci 
starts before C2, and consider the first step tt of Ci in (ii,i2]- There are 
three cases, depending on when tt occurs relative to C2: 

1. If TT occurs before C2, then either it is the last step of Ci that occurs 
before C2, or there is some later step tt' that is the last step that occurs 
before C2. In either case, Ci takes a step that is extraneous for p2, 
and we have Tn,p2p\ 

> 1. 

2. If TT occurs during C2, it is useful for p2, and we again have rup^p^ > 1. 

3. If TT occurs after the end of C2, it either occurs outside of any collect, 
in which case it is the first operation after the end of some collect by 
P2 and is extraneous for p2, or it occurs during some later collect C!^- 
In the latter case, tt is useful for p2, since C2 starts after C2, and thus 
after Ci. Whether tt is extraneous for p2 or useful for p2, we still have 
n^P2pi ^ I* 

We have just shown that mp^p^ > 1 when Ci starts before C2. In the 
symmetric case where C2 starts before Ci, a symmetric argument shows that 
™-piP2 ^ 1- Since one of these two cases holds, we get nip^p^ + mp^p^ > 1 as 
claimed. 

Since m processes carry out at least one step in (ti,t2], there are (™) 
distinct pairs of processes pi,P2 that each carry out at least one step in 
{ti,t2]- We have just shown that nip^p^ + nip^p^ > 1 for each such pair, and 
so, continuing from (9), 

J2 Mp{t2) - Mp{ti) > J2 ("^PiP2 + "ip2Pi) 

P P Pl,P2&S,p\^p2 

> E 1 

Pl,P2GS,pi^P2 
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Turning to the champion, we can trivially bound the number of col- 
lects completed during an interval of length n — 1 by the number of active 
processes: 

Lemma 7 Fix some correct collect algorithm A* . Let t2 = ti + n — 1 and 

suppose that there are exactly m processes that carry out at least one step 
in (^1,^2]- Then A* completes at most m collects during (^1,^2]- 

Proof: Since a process must carry out at least one step to complete a 
collect, the only way A* can complete more than m collects during the 
interval is if some process p completes more than one collect. We will show 
that if this happens, there is an execution that demonstrates that A* is not 
correct. It follows that if A* is correct, then A* completes at most m collects 
during the interval. 

Suppose that there is such a process p, and let tp > ti be the time at 
which p first completes a collect during (ti,t2]- Then {tp,t2] consists of at 
most n — 2 steps, and since at most one register can be read during any 
one step, by the Pigeonhole Principle there are exist (at least) two processes 
Qi , <12 with the property that no process reads a register owned by qi or q2 
process during {tp, t2]. At least one of these two processes is not p; call this 
process q. 

Let V be the value that p returns for g's register from its second collect 
during the interval. We will now construct a modified execution in which 
V is replaced by a different value v' before this collect starts. Let ^ be the 
execution of A* through time t2, and split ^ as ^ = a/? where a is the prefix 
whose last step occurs at time tp. Because no process reads any register 
owned by q in P, we can remove all steps ol q in (3 without affecting the 
execution of the other processes; let /?' be the result of this removal. Now 
construct an execution fragment 7, extending a, in which q runs in isolation 
until it completes its current collect, and then writes a new value v' ^ v to 
its register. Because no process reads any register owned by q in P' , the new 
execution ^' = ajP' is indistinguishable from ^ by any process other than q; 
in particular, p still returns v for q in its second collect, which starts after q 
writes v' in 7. Thus there is an execution in which A* returns an incorrect 
value, and A* is not a correct collect algorithm. ■ 

Combining Lemmas 6 and 7 gives: 

Lemma 8 Let A be a collect algorithm for which CL{A) and PL(^) are 
bounded, and let A* be any collect algorithm. Let t2 = ti + n — 1 and 
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suppose m processes are active in (ii, t2]- Let Fp be the progress measure for 
A as defined in (7) in Lemma 5. Let C he the number of collects completed 
by A* during {ti,t2\- Then 



Proof: 



C - Y2PL(^)-(CL(A) + 2(n-l)) 

1 

~4(CL(^) + 2(n-l))" ^^^^ 



Ep(^p(i2) - Fp{ti) ^ 2 y CL{A)+2(n-l) ^ PL(A) 



> 



C m 

1 r (2) , 

2m \CL{A) + 2(n - 1) PL{A) J 

(11) 



m—l 



4(CL(^) + 2(n- 1)) m 2PL(A)' 

This last quantity (11), treated as a function of m, is of the form 
where a and b are positive constants. Thus its second derivative is which 
is positive for positive m. It follows that (11) is strictly convex when m is 
greater than 0, and thus that it has a unique local minimum (and no local 
maxima) in the range m > 0. This local minimum is not at m = 0, as 
the second term diverges. So it must be at some m > at which the first 
derivative vanishes. 

Taking the first derivative with respect to m and setting the result to 
shows that the unique point at which the first derivative vanishes is when 

1 1 n - 1 



4(CL(A) + 2(n-l)) m2 2PL(A)' 

or 

^_ / 2(n-l)-(CL(A) + 2(n-l)) 

"'"V PL{A) • ^^^^ 

Plugging (12) into (11) and simplifying gives the right-hand side of (10), 
which, as the minimum value of (11) for all m, is a lower bound on the 
left-hand side of (10). ■ 

Equation (10) effectively gives us the inverse of the competitive through- 
put of A, as we can sum over all intervals in the schedule and use Lemma 5 
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to translate the lower bound on J2p to a bound on the number of collects 

completed by A. Asymptotically, we can simplify (10) further by noting 
that CL(^) is always Q.{n), and that PL(^) is likely to be 0(n) for any 
reasonable collect algorithm A. We then get: 

Theorem 9 Let A he a collect algorithm for which PL(^) = 0{n) and 
CIj{A) is bounded. Then A is throughput-competitive with ratio O ^■\/CL(^)j . 

Proof: Fix a schedule a of length t and a request sequence R. Prom 
Lemmas 5 and 8, we have 



done{A,a,R) > ^I-^pW" 
p 

> opt (cr) 



n 



n — 1 



2PL{A) ■ (CL(A) + 2(n- 1)) 
-°P^^"^- 4(CL(yl)!2(n-l)) -" 

= °P^^"^- o(VcW)) ~"- 

The last term is subsumed by the additive constant, and we are left with 
just the ratio k = (^^/C^Afj . ■ 

For example, applying Theorem 9 to the collect algorithm of Ajtai et 
al. [3] gives a competitive throughput of 0(n^/^logn). Similarly, Aspnes 
and Hurwood [10] give a randomized algorithm whose collective latency is 
O(ralog^ra), and use an extended version of Theorem 9 to show that it has 
competitive throughput 0(n^/^ log'^^^ n). 

5.4 Lower bound 

It is a trivial observation that any cooperative collect algorithm has a col- 
lective latency of at least Q{n), for the simple reason that completing even a 
single collect operation requires reading all n registers. It follows that The- 
orem 9 cannot give an upper bound on competitive throughput better than 
0{-\/ri). This turns out to be an absolute lower bound on the competitive 
throughput of any deterministic collect algorithm, as shown in Theorem 10, 
below. 
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Theorem 10 No deterministic algorithm for collect or write-collect has a 
throughput competitiveness less than ft{^/n). 



Proof: Fix some deterministic algorithm A. We will construct a schedule 
in which A completes 0{-y/ri) collects, while an optimal algorithm completes 

J7(n). By iterating this construction, we get an arbitrarily long schedule in 
which the ratio of collects completed by A to those completed by an optimal 
algorithm is l/Q,{^/n). Since an arbitrarily long schedule eventually over- 
shadows any additive constant, it follows that the competitive throughput 
of A is at least VL{^/n). 

Choose a set S* = {pi-,P2-, ■ ■ ■ ,Pm} of m = o(n) processes and construct a 
schedule a in which these processes (and no others) take steps in round-robin 
order. During the first n — m — 1 steps of this schedule, at most n — m — 1 
registers are read, so in particular there is some process p ^ S such that no 
register belonging to p is read in the first n — m—1 steps of the execution 
of A. 

Extend cr to a new schedule a' by splitting a into segments where each 
processes takes two steps, and inserting m + n + 1 steps by p in between the 
first and second round of steps in each segment. The result looks like this: 

xm+n+l 

PlP2 ■■■Pm Pp'^~~P P1P2 ■■■Pm^ 



I n-m-l I 

This new schedule a' is indistinguishable from a to processes in S. So 
in an execution of A under a' , no process in 5 reads any register owned by 
p, and so no process in S completes a collect. Turning to p, since p can 
complete at most one collect for each n — 1 steps (the minimum time to read 
fresh values) , the number of collects completed by p during a' is at most 



(3m -|- n -|- 1) 



n—m—1 
2m 



n-1 



0{n/m). 



In contrast, a better A* might proceed as follows: during each of the 
segments of cr', first pi through pm write out timcstamps (and. 



n—m—1 



2m 

in the case of write-collect, their inputs). Process p then gathers these 
timestamps in m steps (so that in can prove that the values it reads later 
are fresh). Process p uses n more steps to read the n registers, and writes 

the values of these registers, marked with the timestamps, in its last step. 
During the last m steps of the segment, pi through p^ read p's registers to 



22 



finish their collects. Thus an optimal A* finishes at least m + 1 collects per 
segment, for a total of at least (m + 1) = ^{n) collects during a'. 

So far we have mostly demonstrated the "granularity problem" that 
justifies the additive constant in Definition 1. To overcome this constant, 
we need to iterate the construction of a', after extending it further to get A 
back to a state in which every process is about to start a new collect. 

Observe that if a process has not yet completed a collect, it cannot do so 
without executing at least one operation. Let po be the shortest schedule of 
the form a'pp . . .p such that in Algorithm A, process p has finished a collect 
without starting a new collect at the end of po, where p is as in the definition 
of a' . Note that if p has completed all of its collects in a' , po will be equal 
to a', but in general po will add as many as 0{n) additional steps by p. 
Note further that extending a' to po adds at most one additional completed 
collect for A. 

Similarly define, for each i in the range 1 to m, pi as the shortest schedule 
of the form pi^iPiPi ■ ■ - Pi such that pi has finished a collect without starting 
a new collect at the end of pi. As before, each such extension adds at most 
one additional completed collect for A, so that the total number of collects 
completed by A in pm is at most 1 + m more than the number completed in 
a', for a total of O (m + ^) . 

This quantity is minimized when m = Q{^/n), in which case A completes 
0{^/n) collects during p^- Because pm extends a', the number of collects 
completed by A* can only increase, so A* still completes fi(n) collects during 

Pm- 

Since at the end of pm we are in a state where every process is about to 
start a collect, we may repeat the construction to get a sequence of phases, 
in each of which A completes 0{^/n) collects vs. r2(n) for A*. Call the 
schedule consisting of s such phases p*. Then when n is sufficiently large, 
done(yl, /9*', i?) < sc^fn for some constant c, while done(A*, p*, i?) > sc*n 
for some constant c* , where i? is a set of request sequences consisting only 
of collect operations. 

Prom Definition 1, is /c-throughput-competitive only if there exists a 
constant c' such that for all p*, 

done(A, p\R) + c' > ^ optr(p*) > \ done(A*, p^ R). 

Applying our previous bounds on done(74, p*, i?) and done(74*, p*, i?), we get 

sc\/n + c' > —sc*n, 
k 
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and thus 

c*sn 
k> ^ 7. 

cs^/n + c 

Since this last inequahty holds for all s, taking the limit as s goes to infinity 
gives 

c* 

k > — y/n = fl{y/n). 

■ 

Though we concentrate on deterministic algorithms in this paper, it is 
worth noting that a similar construction gives the same lower bound for 
randomized algorithms with an adaptive adversary. The main difference is 
that instead of choosing p to be the last process whose register is read, we 
choose p to have the highest expected time at which its register is first read, 
and cut off a segment when p's register is in fact read. 

6 Applications 

Armed with a throughput-competitive write-collect algorithm and Theo- 
rem 4, it is not hard to obtain throughput-competitive versions of many 
well-known shared-memory algorithms. Examples include snapshot algo- 
rithms [2, 5, 9, 12, 14], the bounded round numbers abstraction [29], concur- 
rent timcstamping systems [27, 28, 31, 33, 34, 39], and time-lapse snapshot [28] 
Here we elaborate on some simple examples. 

6.1 Atomic snapshots 

For our purposes, a snapshot object simulates an array of n single-writer 
registers that support a scan-update operation, which writes a value to one 
of the registers (an "update") and returns a vector of values for all of the 
registers (a "scan"). A scan-update is distinguished from the weaker write- 
collect operation of Section 5.1 by a much stronger serialization condition; 
informally, this says that the vector of scanned values must appear to be a 
picture of the registers at some particular instant during the execution. As 
with write-collect, we arc combining what in some implementations may be 
a separate cheap operation (the update) with an expensive operation (the 
scan).^ 

similar combined operation appears, with its name further abbreviated to scats, 

in [14]. 
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Snapshot objects are very useful tools for constructing more complicated 

shared-memory algorithms, and they have been extensively studied [2, 5, 9, 
12] culminating in the protocol of Attiya and Rachman [14] which uses only 
O(logn) alternating writes and collects to complete a scan-update operation, 
giving 0(n log n) total work. 

We will apply Theorem 4 to get a competitive snapshot. Let T be a 
snapshot object and U a write-collect object. Because a scan-update can be 
used to simulate a write-collect or collect, we have o\)trp{a) < opt[/(c7) for 
any schedule a, and so scan-update is feasible relative to write-collect. 

Now let A be the Attiya-Rachman snapshot algorithm, and let i? be a 
throughput-competitive implementation of write-collect. Let Rhe a set of 
request sequences consisting of scan-update operations. Since each process 
in the Attiya-Rachman snapshot algorithm completes one scan-update for 
every O(logn) write-coUccts, we have done(-B, cr, i?^) < O(logn) • donc(j4 o 
B,a,R) + O(nlogn), where the additive term accounts for write-collect op- 
erations that are part of scan-updates that have not yet finished at the end 
of a. So we have: 

done{Ao B, a, R) + 0{n) ^ 1^1 optr((7) 
done{B,a, Ra) ~ O(logn) ~ 0(log7T,) opt[/((7)' 

since /racopty(o")opt^(o") < 1. Applying Definition 2, the Attiya-Rachman 
snapshot is 0(log n)-throughput-competitive relative to write-collect. By 
Theorem 4, plugging in any fc-throughput-competitive implementation of 
write-collect gives an 0(A;logn)-throughput-competitive snapshot protocol. 
For example, if we use the 0(r?l'^ log n)-competitive protocol of Section 5.2, 
we get an Oir?!^ log^ n)-competitive snapshot. 

6.2 Bounded round numbers 

A large class of wait-free algorithms that communicate via single-writer 
multi-reader atomic registers have a communication structure based on asyn- 
chronous rounds. Starting from round 1, at each round, the process performs 
a computation, and then advances its round number and proceeds to the 
next round. A process's actions do not depend on its exact round number, 
but only on the distance of its current round number from those of other 
processes. Moreover, the process's actions are not affected by any process 
whose round number lags behind its own by more than a finite limit. The 
round numbers increase unboundedly over the lifetime of the system. 

Dwork, Herlihy and Waarts [29] introduced the bounded round numbers 
abstraction, which can be plugged into any algorithm that uses round num- 
bers in this fashion, transforming it into a bounded algorithm. The bounded 
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round numbers implementation in [29] provides four operations of varying 
difficulty; however, the use of these operations is restricted. As a result, 
we can coalesce these operations into a single operation, an advance- collect, 
which advances the current process's round number to the next round and 
collects the round numbers of the other processes. Using their implementa- 
tion, only 0(1) alternating writes and collects are needed to implement an 
advance-collect. 

Again we can apply Theorem 4. Let T be a an object providing the 
advance-collect operation, and let U he a write-collect object. Because an 
advance-collect must gather information from every process in the system, it 
implicitly contains a collect, and optj^{a) < opt^(o") for all schedules a. An 
argument similar to that used above for the Attiya-Rachman snapshot thus 
shows that plugging a fc-throughput-competitive implementation of write- 
collect into the Dwork-Herlihy-Waarts bounded round numbers algorithm 
gives an 0(/c)-throughput-compctitivc algorithm. Using the write-collect 
algorithm of Section 5.2 thus gives an ©(n"^/^ logn)-competitive algorithm. 

7 Conclusions 

We have given a new measure for the competitive performance of dis- 
tributed algorithms, which improves on the competitive latency measure 
of Ajtai et al. [3] by allowing such algorithms to be constructed compo- 
sitionally. Wc have shown that the cooperative collect algorithm of [3] 
is 0(n'^/^ log'^/^ n)-compctitivG by this measure, from which we get an 
0(n'^/'^ log^/^ n)-compctitivc atomic snapshot by modifying the protocol 
of [14], and an 0(?r^/ Mog"^'''^ n)-competitive bounded round numbers pro- 
tocol by modifying the protocol of [29]. These modifications require only 
replacing the collect subroutine used in these protocols with a cooperative 
collect subroutine, and the proof of competitiveness does not require ex- 
amining the actual working of the modified protocols in detail. We believe 
that a similar straightforward substitution could give competitive versions 
of many other distributed protocols. 

We discuss some related approaches to analyzing the competitive ratio 
of distributed algorithms in Section 7.1. Some possible extensions of the 
present work are mentioned in Section 7.2. 

Finally, we note that competitive ratios of 0(n^/^) are not very good, 
but they are not too much worse than Theorem lO's lower bound of 0(n^/^). 
We describe some related work that gets closer to this bound (and, for a 
modified version of the problem, breaks it) in Section 7.3. 
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7.1 Related work 



A notion related to allowing only other distributed algorithms as champions 
is the very nice idea of comparing algorithms with partial information only 
against other algorithms with partial information. This was introduced by 
Papadimitriou and Yannakakis [45] in the context of linear programming; 
their model corresponds to a distributed system with no communication. A 
generalization of this approach has recently been described by Koutsoupias 
and Papadimitriou [41]. 

In addition, there is a long history of interest in optimality of a dis- 
tributed algorithm given certain conditions, such as a particular pattern of 
failures [26, 30, 35, 42-44], or a particular pattern of message delivery [13, 32, 
46]. In a sense, work on optimality envisions a fundamentally different role 
for the adversary in which it is trying to produce bad performance for both 
the candidate and champion algorithms; in contrast, the adversary used in 
competitive analysis usually cooperates with the champion. 

Nothing in the literature corresponds in generality to our notion of rel- 
ative competitiveness (Definition 2) and the composition theorem (Theo- 
rem 4) that uses it. Some examples of elegant specialized constructions of 
competitive algorithms from other competitive algorithms in a distributed 
setting are the natural potential function construction of Bartal et al. [21] 
and the distributed paging algorithm of Awcrbuch et al. [18]. However, not 
only do these constructions depend very much on the particular details of 
the problems being solved, but, in addition, they permit no concurrency, 
i.e. they assume that no two operations are ever in progress at the same 
time. (This assumption does not hold in general in typical distributed sys- 
tems.) In contrast, the present work both introduces a general construction 
of compositional competitive distributed algorithms and does so in the nat- 
ural distributed setting that permits concurrency. 

7.2 Variations on competitiveness 

Our work defines compositional competitiveness and relative competitive- 
ness by distinguishing between two sources of nondeterminism, one of which 
is shared between the on-line and off-line algorithms, i.e. the schedule, and 
the other is not, i.e. the input. One can define analogous notions to compo- 
sitional competitiveness and to relative competitiveness by considering any 
two sources of nondeterminism, one of which is shared between the on-line 
and off-line algorithms, and one that is not. This leads to a general notion 
of semicompetitive analysis, which has been described in a survey paper by 
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the first author [7], based in part on the present work. 
7.3 Improved collect algorithms 

Since the appearance of the conference version of this paper, Aspnes and 
Hurwood [10] and Aumann [15] have shown that weakening some of the 
requirements of the model used here can greatly improve performance. 

In particular, Aspnes and Hurwood [10] have shown that with an ad- 
versary whose knowledge of the system state is limited, collects can be 
performed with a near-optimal 0(n^/^ log^^^ n) competitive ratio in the 
throughput-competitiveness model. Aumann [15] has shown that, for some 
applications, the freshness requirement can be weakened to allow a process 
to obtain a value that is out-of-date for its own collect, but that was current 
at the start of some other process's collect. He shows that with this weak- 
ened requirement an algorithm based on the Aspnes-Hurwood algorithm can 
perform collects with a competitive ratio of only O(log^n). 
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